Cycle Prevention in Distributed Checkpointing
نویسندگان
چکیده
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint Given a set of processes that take basic local checkpoints in an independent and unknown way this paper presents a predicate that directs processes to take additional local forced checkpoints in order to ensure that no local checkpoint be useless This predicate has two noteworthy properties it can be locally evaluated by each process without requiring additional synchronization and it ensures that as few as possible additional local checkpoints are taken As this predicate ensures cycle freeness in the checkpointing graph it prevents the domino e ect in a particularly e cient manner Introduction A local checkpoint is a snapshot of a local state of a process a global checkpoint is a set of local checkpoints one from each process and a consistent global checkpoint is a global checkpoint such that no message sent by a process after its local checkpoint is received by another process before its local checkpoint So the consistency of global checkpoints strongly depends on the ow of messages exchanged by processes The determination of consistent global checkpoints is a fundamental problem in distributed computing and arises in many applications such as detection of stable properties determination of breakpoints detection of unstable properties rollback recovery upon failure occurences etc When processes independently take their local checkpoints there is a risk that no consistent global checkpoint can ever be formed except the rst one composed of their initial states This is caused by the well known unbounded domino e ect Even if consistent global checkpoints can be formed it is still possible that some local checkpoints can never be included in a consistent global checkpoint such local checkpoints are called useless To prevent useless checkpoints and thus safely prevent the domino e ect some coordination in the taking of local checkpoints is required In the family of coordinated protocols processes use additional control messages to synchronize their checkpointing activities This additional synchronization may result in reduced process autonomy and degraded performance of the underlying application These drawbacks have given rise to the development of a family of communication induced checkpointing protocols In this family the coordination is achieved by piggybacking control information on application messages no control messages or synchronization is added to the application More precisely processes take local checkpoints independently these local checkpoints are called basic checkpoints and the protocol directs them to take additional local checkpoints called forced checkpoints to ensure that no local checkpoint becomes useless Taking a forced checkpoint before each message delivery is a safe strategy to prevent useless checkpoints but is very ine cient Given a set of basic checkpoints the fewer the forced checkpoints are taken by a communication induced checkpointing protocol the better the protocol A process decides whether to take or not a forced checkpoint when a message is received by evaluating a predicate This predicate is based on local control variables of the receiving process and on control values carried by the message The local control variables managed by a process are a coding of the causal dependencies appearing in its past Distinct semantics for these control variables and distinct de nitions of the predicate give rise to di erent protocols In this paper we present a new predicate that allows to design a communication induced check pointing protocol that takes as few forced checkpoints as possible while ensuring no local checkpoint is useless This predicate is based on the Z path and Z cycle theory introduced by Netxer and Xu who showed that a useless checkpoint exactly corresponds to the existence of a Z cycle in the distributed computation At the model level our predicate prevents Z cycles The paper derived from is based on the theory introduced in It is composed of two main sections Section presents the model of distributed computations provides a de nition for consistent global checkpoints and de nes Z paths Then Section introduces the predicate that can be used to prevent Z cycles in the Z graph Distributed Computations Checkpoints and Z Paths Distributed Computations A distributed computation consists of a nite set P of n processes fP P Png that communi cate and synchronize only by exchanging messages We assume that each ordered pair of processes is connected by an asynchronous reliable directed logical channel whose transmission delays are unpre dictable but nite Note that channels are not required to be fifo Each process runs on a di erent processor processors do not share a common memory and there is no bound on their relative speeds Also they fail according to the fail stop model A process can execute internal send and delivery statements An internal statement does not involve communication When Pi executes the statement send m to Pj it puts the message m into the channel from Pi to Pj When Pi executes the statement deliver m it is blocked until at least one message directed to Pi has arrived then a message is withdrawn from one of its input channels and delivered to Pi Executions of internal send and delivery statements are modeled by internal sending and delivery events Processes of a distributed computation are sequential in other words each process Pi produces a sequence of events ei ei s This sequence can be nite or in nite Every process Pi has an initial local state denoted i The local state i s s results from the execution of the sequence ei ei s applied to the intial state i More precisely the event ei s moves Pi from the local state i s to the local state i s By de nition we say that ei x belongs to j s sometimes denoted as ei x j s if i j and x s Let H be the set of all the events produced by a distributed computation This computation is modeled by the partially ordered set b H H hb where hb denotes the well known Lamport s happened before relation Local and Global Checkpoints Local checkpoints A local checkpoint C is a recorded state snapshot of a process Not every local state is necessarily recorded as a local checkpoint so the set of local checkpoints is only a subset of the set of local states De nition A communication and checkpoint pattern is a pair b H C b H where b H is a distributed computation and C b H is a set of local checkpoints de ned on b H Ci x represents the x th local checkpoint of process Pi The local checkpoint Ci x corresponds to some local state i s with x s Figure shows an example of a checkpoint and communication pattern We assume that each process Pi takes an initial local checkpoint Ci corresponding to i and after each event a checkpoint will eventually be taken Pi
منابع مشابه
Transaction-Consistent Global Checkpoints in a Distributed Database System
Checkpointing and rollback recovery are well-known techniques for handling failures in distributed database systems. In this paper, we establish the necessary and sufficient conditions for the checkpoints on a set of data items to be part of a transaction-consistent global checkpoint of the distributed database. This can throw light on designing efficient, non-intrusive checkpointing techniques...
متن کاملEfficient Techniques for Adaptive Independent Checkpointing in Distributed Systems
This work presents two novel algorithms to prevent rollback propagation for independent checkpointing: an efficient adaptive independent checkpointing algorithm and an optimized adaptive independent checkpointing algorithm. The last opportunity strategy that yields a better performance than the conservation strategy is also employed to prevent useless checkpoints for both causal rewinding paths...
متن کاملReview of Some Checkpointing Schemes for Distributed and Mobile Computing Environments
Mr Raman Kumar Mewar University, Chittorgargh (Raj) Email: [email protected] Dr Parveen Kumar Amity University Gurgaon (Haryana) Email: [email protected] ---------------------------------------------------------------------ABSTRACT------------------------------------------------------Fault Tolerance Techniques facilitate systems to carry out tasks in the incidence of faults. A checkpoint is a...
متن کاملA Checkpointing Protocol Based on a Minimal Characterization of the \No-Z-Cycle" Property
Given a checkpoint and communication pattern of a distributed execution the \No Z-Cycle" property (N ZC) states that there not exists a dependency between a checkpoint and itself. In other words, there not exists a non-causal sequence of messages that starts after a checkpoint and terminates before that checkpoint. From an operational point of view, that property corresponds to the fact that ea...
متن کاملA VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
A useless checkpoint corresponds to the occurrence of a checkpoint and communication pattern called Z-cycle. A recent result shows that ensuring a computation without Z-cycles is a particular application of a property, namely Virtual Precedence (VP), defined on an interval-based abstraction of a computation. In this paper we first propose a taxonomy of communication-induced checkpointing protoc...
متن کاملA New Checkpointing Approach for Mobile Distributed System
In this paper, we introduce a weighted checkpointing approach for the mobile distributed computing system (MDCS) that significantly reduces checkpointing overheads on mobile nodes. Checkpoint protocols proposed so far in the literature for MDCS are either coordinated, log based or quasi-synchronous. Coordinated checkpointing requires extra synchronization messages and may block the underlying c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997